Identification of New Small RNAs and ORFs of E. Coli as Mediators of Cell and Intercell Regulation

ABSTRACT

The invention relates to new small RNAs and ORFs of  E. coli  as mediators of cell and intercell regulation.

RELATED APPLICATIONS

This application is a divisional of U.S. application Ser. No.11/543,601, filed Oct. 5, 2006, which is a divisional of U.S.application Ser. No. 10/627,007, filed Jul. 25, 2003 (U.S. Pat. No.7,119,193 issued Oct. 10, 2006), which is a continuation and claims thebenefit of priority of International Application No. PCT/US02/03147filed Jan. 31, 2002, designating the United States of America andpublished in English, which claims the benefit of priority of U.S.Provisional Application No. 60/266,402 filed Feb. 1, 2001, all of whichare hereby expressly incorporated by reference in their entireties.

Incorporated by reference herein in its entirety is the Sequence Listingfiled in the parent application, U.S. patent application Ser. No.11/543,601, filed Oct. 5, 2006 with the U.S. Patent and TrademarkOffice, size of 32 kilobytes.

FIELD OF THE INVENTION

The invention relates to new small RNAs and ORFs of E. coli as mediatorsof cell and intercell regulation.

BACKGROUND OF THE INVENTION

In the last few years, the importance of regulatory small RNAs (sRNAs)as mediators of a number of cellular processes in bacteria has begun tobe recognized. Although instances of naturally occurring antisense RNAshave been known for many years, the participation of sRNAs in proteintagging for degradation, modulation of RNA polymerase activity, andstimulation of translation are relatively recent discoveries (seeWassarman, K. M. et al. 1999 Trends Microbiol 7:37-45 for review;Wassarman, K. M. and Storz, G. 2000 Cell 101:613-623). These findingshave raised questions about how extensively sRNAs are used, what othercellular activities might be regulated by sRNAs, and what othermechanisms of action exist for sRNAs. In addition, prokaryotic sRNAsappear to target different cellular functions than their eukaryoticcounterparts that primarily act during RNA biogenesis. It is unclearwhether this difference between prokaryotic and eukaryotic sRNAs isaccurate or stems from the incompleteness of current knowledge. Implicitin these questions is the question of how many sRNAs exist in a givenorganism and whether the current known sRNAs are truly representative ofsRNA function in general.

To date, most known bacterial sRNAs have been identified fortuitously bythe direct detection of highly abundant sRNAs (4.5S RNA, tmRNA, 6S RNA,RNase P RNA, and Spot42 RNA), by the observation of an sRNA duringstudies on proteins (OxyS RNA, Crp Tic RNA, CsrB RNA, and GcvB RNA) orby the discovery of activities associated with overexpression of genomicfragments (MicF RNA, DicF RNA, DsrA RNA, and RprA RNA) (Okamoto, K. andFreundlich, M. 1986 PNAS USA 83:5000-5004; Bhasin, R. S. 1989 Studies onthe mechanism of the autoregulation of the crp operon of E. coli K12 In:Dept. of Biochemistry and Cell Biology, State University of New York atStonybrook; Urbanowski, M. L. et al. 2000 Mol Microbiol 37:856-868;Wassarman, K. M. and Storz, G. 2000 Cell 101:613-623; Majdalani, N. etal. 2001 Mol Microbiol 39:1382-1394; for review see Wassarman, K. M. etal. 1999 Trends Microbiol 7:37-45). None of the E. coli sRNAs were foundas a result of mutational screens. This observation may reflect thesmall target size of genes encoding sRNAs compared to protein genes, ormay be a consequence of the regulatory rather than essential nature ofmany sRNA functions. The complete genome sequence of an organismprovides a rapid inventory of most encoded proteins, tRNAs, and rRNAs,but it has not led to the immediate recognition of other genes that arenot translated. In particular, new bacterial sRNA genes have beenoverlooked, as there are no identifiable classes of sRNAs that can befound based solely on sequence determinants.

SEGUE TO THE SUMMARY OF THE INVENTION

We and others have previously suggested several approaches to look fornew sRNAs including computer searching of complete genomes based onparameters common to sRNAs, probing of genomic microarrays, andisolating sRNAs based on an association with general RNA bindingproteins (Wassarman, K. M. et al. 1999 Trends Microbiol 7:37-45; Eddy,S. R. 1999 Curr Opin Genet Dev 9:695-699). Using a combination of theseapproaches, we have identified 17 novel sRNAs; in addition, we havefound six small transcripts that contain short conserved open readingframes (ORFs).

SUMMARY OF THE INVENTION

A burgeoning list of small RNAs with a variety of regulatory functionshas been identified in both prokaryotic and eukaryotic cells. However,it remains difficult to identify small RNAs by sequence inspection. Weutilized the high conservation of small RNAs among closely relatedbacterial species, as well as analysis of transcripts detected byhigh-density oligonucleotide probe arrays, to predict the presence ofnovel small RNA genes in the intergenic regions of the Escherichia coligenome. The existence of 23 distinct new RNA species was confirmed byNorthern analysis. Of these, six are predicted to encode short ORFs,whereas 17 are novel functional small RNAs. Based on the interaction ofthese small RNAs with the RNA binding protein Hfq, the modulation ofrpoS expression, and other information, we contemplate these new smallRNAs and ORFs of E. coli as mediators of cell and intercell regulation.As such, we anticipate their use in the development of diagnostics andin the development of antibiotics.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 shows BLAST alignments of representative Ig regions. Theindicated Ig regions were used in a BLAST search of the NCBI UnfinishedMicrobial Genomes database. Each panel shows the summary figure providedby the BLAST program for matches to Salmonella enteritidis, Salmonellaparatyphi A, Salmonella typhi, Salmonella typhimurium LT2 and Klebsiellapneumoniae, three contain known sRNA genes (rprA, csrB, and oxyS), andfour contain sRNA candidates (#14, #17, #52, and #36; see Table 1). Foreach panel, the center numbered line represents the length of the fullIg region; the orientation of flanking genes is given by > (clockwise)or < (counterclockwise). The top hatched line in each panel is the matchto E. coli (full identity throughout the Ig). The other hatched ordouble-diagonal lines resulted from the closest matches, and the otherlines indicate additional less homologous matches. Location of theconserved region with respect to the borders of the Ig region also was acriterion used for the selection of our candidates; conservation 3′ toan ORF or far from the 5′ start of an ORF was considered more likely toencode an sRNA. Note that the conservation within the Ig region encodingoxyS might be interpreted as a leader sequence based on locationrelative to the start of the flanking gene (oxyR). However, theconservation extends for 185 nt, and therefore candidate regions in oursearch in which the conservation was near the start of an ORF but waslonger than 150 nt were considered further.

FIG. 2 is the expression profile across high-density oligonucleotidearrays for representative Ig regions. Probe intensities are shown forthe indicated Ig regions (solid bars) and the flanking ORFs (hatchedbars), calculated from the perfect match minus the mismatch intensities.All negative differences were set to zero. The data shown are for oneexperiment using cDNA probes, but similar results were seen in theduplicate experiment and with directly labeled RNA probes. The Igregions and each flanking gene generally contain 15 interrogatingprobes. Upward bars correspond to genes transcribed on the Watson (W,clockwise) strand and downward bars correspond to genes transcribed onthe Crick (C, counterclockwise) strand. The C strand signal for the CsrBIg region corresponds well with the known location of the csrB gene.Similarly for the RprA Ig region, the W strand signal corresponds withthe location of the rprA gene, but only one probe is positive. The Wstrand signal for #14 and the C strand signal for #17 overlap well withthe conserved regions shown in the BLAST analysis in FIG. 1. #36 waschosen for further analysis because of the strong C strand signal; bothflanking ORFs are on the W strand. For #52, low levels of expressionwere seen on both strands; the very low level for probes in the middleof the Ig on the C strand overlapped best with the conserved regionfound by the BLAST searches (FIG. 1).

FIG. 3 shows detection of novel sRNAs by Northern hybridization.Northern hybridization using strand specific probes for each candidatewas done on RNA extracted from MG1655 cells grown under three differentgrowth conditions: (E), exponential growth in LB medium; (M),exponential growth in M63-glucose medium, and (S) stationary phase in LBmedium. Five μg of total RNA was loaded in each lane. Exposure timeswere optimized for each panel for visualization here, therefore thesignal intensity shown does not indicate relative abundance betweensRNAs. Oligonucleotide probes were used for #12, #22, #55-1, #55-1, and#61; RNA probes were used for all other panels. DNA molecular weightmarkers (5′-end-labeled MspI digested pBR322 DNA) were run with each setof samples for direct estimation of RNA transcript length. One lane ofDNA molecular weight markers are shown for comparison, but these areapproximate sizes as there was slight variation in running of gels.

FIG. 4 shows results of coimmunoprecipitation of sRNAs with the Hfqprotein. (A) Immunoprecipitations using extract from MG1655 cells grownin LB medium in exponential growth (OD₆₀₀=0.4) were done using noantibody (lane 1); 5 μl of preimmune serum (lane 2); or 0.5, 1, 5, or 10μl of hfq antisera (lanes 3-6). Selected RNAs were fractionated on a 10%polyacrylamide urea gel after 3′-end labeling. Asterisks mark RNA bandspresent in the anti-hfq precipitated samples but not in the preimmunecontrol samples and therefore represent Hfq-interacting RNAs. (B)Immunoprecipitations were done using extract from MG1655 cells grownunder three different growth conditions: (E) exponential growth in LBmedium; (M) exponential growth in M63-glucose medium, and (S) stationaryphase in LB medium. Immunoprecipitations were carried out with 5 μl ofpreimmune sera (lane 1) or 5 μl Hfq antisera (lane 2) and compared tototal RNA from 1/10 extract equivalent used in the immunoprecipitations(lane 3). RNAs were fractionated on 10% polyacrylamide urea gels andanalyzed by Northern hybridization using RNA probes to previously knownsRNAs or our novel RNAs as indicated.

BRIEF DESCRIPTION OF THE SEQUENCES

Candidate Number SEQ ID NO 12 1 14 2 22 3 24 4 25 5 26 6 27 7 31 8 38 940 10 41-I 11 41-II 12 52-I 13 52-II 14 55-I 15 55-II 16 61 17  8 18 4319  9 (nucleotide) 20  9 (amino acid) 21 17 (nucleotide) 22 17 (aminoacid) 23 28 (nucleotide) 24 28 (amino acid) 25 36 (nucleotide) 26 36(amino acid) 27 49 (nucleotide) 28 49 (amino acid) 29 50 (nucleotide) 3050 (amino acid) 31

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT

By “RNA” or “gene product” or “transcription product” is meant the RNAencoded by the E. coli gene or RNA substantially homologous orcomplementary thereto or a derivative or fragment thereof having RNAactivity. Encompassed by the definition of “RNA” are variants of RNA inwhich there have been trivial mutations such as substitutions,deletions, insertions or other modifications of the native RNA. The term“substantial homology” or “substantial identity”, when referring topolypeptides or polynucleotides, indicates that the sequence of apolypeptide or polynucleotide in question, when properly aligned,exhibits at least about 30% identity with the sequence of an entirenaturally occurring polypeptide or polynucleotide or a portion thereof.Polynucleotides of the present invention which are homologous orsubstantially homologous to, for example, the polynucleotides of theinvention are usually at least about 70% identity to that shown in theSequence Listing, preferably at least about 90% identity and mostpreferably at least about 95% identity, or a complement thereof. Anytechnique known in the art can be used to sequence polynucleotides,including, for example, dideoxynucleotide sequencing (Sanger et al. 1977PNAS USA 74:5463-5467), or using the Sequenase™ kit (United StatesBiochemical Corp.). Homologs of polynucleotides and polypeptides,whether synthetically or recombinantly produced or found in nature, arealso encompassed by the scope of the invention, and are herein definedas polynucleotides and polypeptides which are homologous to,respectively, polynucleotides and polypeptides of the invention, orfragments, variants, or complements thereof. Homologous polynucleotidesand polypeptides are generally encoded by homologous genes as describedabove, and retain significant amino acid residue or nucleotide identityto the genes of the invention. Such polypeptides can be expressed byother organisms such as bacteria, yeast and higher order organisms suchas mammals. Various methods of determining amino acid residue ornucleotide identity are known in the art. Homologous polynucleotides orpolypeptides can be obtained by in vitro synthesis by expressing genesderived from other bacteria or by mutagenizing genes of the invention.Also included in the definition of “substantially homologouspolynucleotides” would be those polynucleotides which, when annealedunder conditions known in the art, would remain annealed under moderatewash conditions also known in the art (such as washing in 6× SSPE twiceat room temperature and then twice at 37° C.) (Wahl et al. 1987 Methodsin Enzymology 152 Academic Press Inc., San Diego).

Polynucleotide and polypeptide homology is typically measured usingsequence analysis software. See, e.g., Sequence Analysis SoftwarePackage of the Genetics Computer Group, University of WisconsinBiotechnology Center, 1710 University Avenue, Madison, Wis. 53705.

By “polynucleotide” or “nucleic acid” is meant a single- ordouble-stranded DNA, genomic DNA, cDNA, RNA, DNA-RNA hybrid, or apolymer comprising purine and pyrimidine bases, or other natural,chemically or biochemically modified or containing non-natural orderivatized nucleotide bases. The backbone of the polynucleotide cancomprise sugars and phosphate groups (as typically found in RNA or DNA),or modified or substituted sugar or phosphate groups. Alternatively, thebackbone of the polynucleotide can comprise a polymer of syntheticsubunits such as phosphoramidates and is thus a oligodeoxynucleosidephosphoramidate (P—NH₂) or a mixed phosphoramidate-phosphodiesteroligomer (Peyrottes et al. 1996 Nucleic Acids Res 24:1841-8; Chaturvediet al. 1996 Nucleic Acids Res 24:2318-23; and Schultz et al. 1996Nucleic Acids Res 24:2966-73). In another embodiment, a phosphorothiatelinkage can be used in place of a phosphodiester linkage (Braun et al.1988 J Immunol 141:2084-9; and Latimer et al. 1995 Mol Immunol32:1057-1064). In addition, a double-stranded polynucleotide can beobtained from the single-stranded polynucleotide product of chemicalsynthesis either by synthesizing the complementary strand and annealingthe strands under appropriate conditions, or by synthesizing thecomplementary strand de novo using a DNA polymerase with an appropriateprimer.

A nucleic acid is said to “encode” an RNA or a polypeptide if, in itsnative state or when manipulated by methods known to those skilled inthe art, it can be transcribed and/or translated to produce the RNA, thepolypeptide or a fragment thereof. The anti-sense strand of such anucleic acid is also said to encode the sequence. The polynucleotides ofthe present invention comprise those which are naturally-occurring,synthetic or recombinant.

A “recombinant” nucleic acid is one which is chemically synthesized orthe product of the artificial manipulation of isolated segments ofnucleic acids, e.g., by genetic engineering techniques. Isolatedsegments within a recombinant nucleic acid can be naturally occurringsequences.

The invention relates to new small RNAs and ORFs of E. coli as mediatorsof cell and intercell regulation. By “polynucleotide” or “gene” or “RNA”and the like is meant a polynucleotide encoding or comprising the RNA ofthe invention, or a homolog, fragment, derivative or complement thereofand having RNA activity as described herein. As is known in the art, aDNA can be transcribed by an RNA polymerase to produce RNA, but an RNAcan be reverse transcribed by reverse transcriptase to produce a DNA.Thus a DNA can encode an RNA and vice versa.

The invention also encompasses vectors such as single- anddouble-stranded plasmids or viral vectors comprising RNA, DNA or amixture or variant thereof, further comprising a polynucleotide of theinvention. A wide variety of suitable expression systems are known inthe art and are selected based on the host cells used, inducibility ofexpression desired and ease of use. The non-transcribed portions of agene and the non-coding portions of a gene can be modified as known inthe art. For example, the native promoters can be deleted, substitutedor supplemented with other promoters known in the art; transcriptionalenhancers, inducible promoters or other transcriptional control elementscan be added, as can be replication origins and replication initiatorproteins, autonomously replicating sequence (ARS), marker genes (e.g.antibiotic resistance markers), sequences for chromosomal integration(e.g., viral integration sites or sequences homologous to chromosomalsequences), restriction sites, multiple cloning sites, ribosome-bindingsites, RNA splice sites, polyadenylation sites, transcriptionalterminator sequences, mRNA stabilizing sequences, 5′ stem-loop toprotect against degradation, and other elements commonly found onplasmids and other vectors known in the art. Secretion signals fromsecreted polypeptides can also be included to allow the polypeptide tocross and/or lodge in cell membranes or be secreted from the cell. Suchvectors can be prepared by means of standard recombinant techniquesdiscussed, for example, in Sambrook et al. 1989 Molecular Cloning: ALaboratory Manual, 2nd edition, Cold Spring Harbor Press, Cold SpringHarbor Laboratory, N.Y.; and Ausubel et al. (eds.), 1987 CurrentProtocols in Molecular Biology, Greene Publishing Associates, Brooklyn,N.Y.). Many useful vectors are known in the art and can be obtained fromvendors including, but not limited to, Stratagene, New England Biolabs,and Promega Biotech.

An appropriate promoter and other necessary vector sequences areselected so as to be functional in the chosen host. While prokaryotichost cells are preferred, mammalian or other eukaryotic host cells,including, but not limited to, yeast, filamentous fungi, plant, insect,amphibian or avian species, can also be useful for production of thepolypeptides of the present invention. See, Kruse et al. (eds.) 1973Tissue Culture Academic Press. Examples of workable combinations of celllines and expression vectors are described in Sambrook et al. 1989 orAusubel et al. 1987; see also, e.g., Metzger et al. 1988 Nature334:31-36. Examples of commonly used mammalian host cell lines are VEROand HeLa cells, Chinese hamster ovary (CHO) cells, and W138, BHK, andCOS cell lines, or others as appropriate, e.g., to provide higherexpression, desirable glycosylation patterns, etc.

By “bacterial host cell” or “bacteria” or “bacterium” is meant variousmicro-organism(s) containing at least one chromosome but lacking adiscrete nuclear membrane. Representatives include E. coli, Bacillus,Salmonella, Pseudomonas, Staphylococcus and other eubacteria,archaebacteria, chlamydia and rickettsia and related organisms, and thelike, and may be spherical, rod-like, straight, curved, spiral,filamentous or other shapes.

Vectors suitable for use with various cells can comprise promoters whichcan, when appropriate, include those naturally associated with genes ofthe invention. Promoters can be operably linked to a polynucleotide ofthe invention.

A nucleic acid sequence is “operably linked” when it is placed into afunctional relationship with another nucleic acid sequence. Forinstance, a promoter is operably linked to a coding sequence if thepromoter affects the transcription or expression of the gene. Generally,operably linked means that the DNA sequences being linked are contiguousand, where necessary to join two protein coding regions, contiguous andin reading frame.

Promoters can be inducible or repressible by factors which respondbiochemically to changes in temperature, osmolarity, carbon source,sugars, etc., as is known in the art. Promoters including, but notlimited to, the trp, lac and phage promoters, tRNA promoters andglycolytic enzyme promoters can be used in prokaryotic hosts. Usefulyeast promoters include, but are not limited to, the promoter regionsfor metallothionein, 3-phosphoglycerate kinase or other glycolyticenzymes such as enolase or glyceraldehyde-3-phosphate dehydrogenase,enzymes responsible for maltose and galactose utilization. Appropriateforeign mammalian promoters include, but are not limited to, the earlyand late promoters from SV40 (Fiers et al. 1978 Nature 273:113-120) andpromoters derived from murine Moloney leukemia virus, mouse mammarytumor virus, avian sarcoma viruses, adenovirus II, bovine papillomavirus or polyoma. In addition, the construct can be joined to anamplifiable gene (e.g., DHFR) so that multiple copies of the constructcan be made. For appropriate enhancer and other expression controlsequences suitable for vectors, see also Enhancers and Eukaryotic GeneExpression, Cold Spring Harbor Press: N.Y. 1983.

While expression vectors are preferably autonomously replicating, theycan also be inserted into the genome of the host cell by methods knownin the art. Expression and cloning vectors preferably contain aselectable marker which is a gene encoding a protein necessary under atleast one control for the survival or growth of a host cell transformedwith the vector. The presence of this gene ensures the growth of onlythose host cells which express the inserts. Typical selection genes areknown in the art and include, but are not limited to, those which encodeproteins that (a) confer resistance to antibiotics or other toxicsubstances, e.g., ampicillin, neomycin, methotrexate, etc.; (b)complement auxotrophic deficiencies, or (c) supply critical nutrientsnot available from complex media, e.g., the gene encoding D-alanineracemase for Bacilli. The choice of the proper selectable marker dependson the host cell, as appropriate markers for different hosts are wellknown.

As one of skill in the art will understand, the choice in constructionand arrangement of markers, promoters, origins of replication, etc. invarious vectors of the present invention will be dictated by the desiredlevel and timing of expression of RNA of the invention, with theultimate goal of regulating the production of metabolic compounds in thehost cell.

By “protein” or “polypeptide” is meant a polypeptide encoded by the E.coli gene of the invention or a polypeptide substantially homologousthereto and having protein activity. Encompassed by the proteins of theinvention are variants thereof in which there have been trivialsubstitutions, deletions, insertions or other modifications of thenative polypeptide which substantially retain protein characteristics,particularly silent or conservative substitutions. Silent nucleotidesubstitutions are changes of one or more nucleotides which do not changeany amino acid of protein. Conservative substitutions includesubstitutions within the following groups: glycine, alanine; valine,isoleucine, leucine; aspartic acid, glutamic acid; asparagine,glutamine; serine, threonine; lysine, arginine; and phenylalanine,tyrosine. Such conservative substitutions are not expected to interferewith biochemical activity, particularly when they occur in structuralregions (e.g., alpha helices or beta pleated sheets) of the polypeptide,which can be predicted by standard computer analysis of the amino acidsequence of the protein. Also encompassed by the claimed polypeptides ofthe invention are polypeptides encoded by polynucleotides which aresubstantially homologous to a polynucleotide of the invention.

Nucleic acids encoding the polypeptides of the present invention includenot only native or wild-type sequences but also any sequence capable ofencoding the polypeptide, which can be synthesized by making use of theredundancy in the genetic code. Various codon substitutions can beintroduced, e.g., silent or conservative changes as discussed above. Dueto degeneracy in the genetic code there is some degree of flexibility inthe third base of each codon and some amino acid residues are encoded byseveral different codons. Each possible codon could be used in the geneto encode the protein. While this may appear to present innumerablechoices, in practice, each host has a particular preferred codon usage,so that genes can be tailored for optimal translation in the host inwhich they are expressed. Thus, synthetic genes that encode the proteinsof the invention are included in this invention.

Techniques for nucleic acid manipulation are described generally, forexample, in Sambrook et al. (1989) and Ausubel et al. (1987). Reagentsuseful in applying such techniques, such as restriction enzymes and thelike, are widely known in the art and commercially available fromvendors including, but not limited to, New England BioLabs, BoehringerMannheim, Amersham, Promega Biotech, U.S. Biochemicals, New EnglandNuclear, and a number of other sources.

Nucleic acid probes and primers based on sequences of the invention canbe prepared by standard techniques. Such a probe or primer comprises anisolated nucleic acid. In the case of probes, the nucleic acid furthercomprises a label (e.g., a radionuclide such as ³²P ATP or ³⁵S) or areporter molecule (e.g., a ligand such as biotin or an enzyme such ashorseradish peroxidase). The [³²P]-ATP, [³⁵S]-dATP and [³⁵S]-methioninecan be purchased, for example, from DuPont NEN (Wilmington, Del.).Probes can be used to identify the presence of a hybridizing nucleicacid sequence, e.g., an RNA in a sample or a cDNA or genomic clone in alibrary. Primers can be used, for example, for amplification of nucleicacid sequences, e.g., by the polymerase chain reaction (PCR). See, e.g.,Innis et al. (eds.) 1990 PCR Protocols: A Guide to Methods andApplications, Academic Press: San Diego. The preparation and use ofprobes and primers is described, e.g., in Sambrook et al. (1989) orAusubel et al. (1987). The genes of homologs of RNA of the invention inother species can be obtained by generating cDNA from RNA from suchspecies using any technique known in the art, such as using RiboclonecDNA Synthesis Systems AMV RT (Promega, Madison, Wis.), then probingsuch cDNA with radiolabeled primers containing various portions (e.g. 30or 40 bases long) of the sequences disclosed herein. To obtain homologsof the proteins of the invention, degenerate primers can encode theamino acid sequence of the disclosed E. coli protein but differ in codonusage from the sequences disclosed.

Antisense and ribozyme nucleic acids capable of specifically binding tosequences of the invention are also useful for interfering with geneexpression.

The nucleic acids of the present invention (whether sense or anti-sense,and whether encoding the genes of the invention, or a homolog, variant,fragment or complement thereof) can be produced in large amounts byreplication of a suitable recombinant vector comprising DNA sequences ina compatible host cell. Alternatively, these nucleic acids can bechemically synthesized, e.g., by any method known in the art, including,but not limited to, the phosphoramidite method described by Beaucage etal. 1981 Tetra Letts 22:1859-1862, and the triester method according toMatteucci et al. 1981 J Am Chem Soc 103:3191, preferably usingcommercial automated synthesizers. The purification of nucleic acidsproduced by the methods of the present invention can be achieved by anymethod known in the art including, but not limited to, those described,e.g., in Sambrook et al. (1989), or Ausubel et al. (1987). Numerouscommercial kits are available for DNA purification including Qiagenplasmid mini DNA cartridges (Chatsworth, Calif.).

The nucleic acids of the present invention can be introduced into hostcells by any method known in the art, which vary depending on the typeof cellular host, including, but not limited to, electroporation;transfection employing calcium chloride, rubidium chloride calciumphosphate, DEAE-dextran, or other substances; microprojectilebombardment; P1 transduction; use of suicide vectors; lipofection;infection (where the vector is an infectious agent, such as a retroviralgenome); and other methods. See generally, Sambrook et al. (1989), andAusubel et al. (1987). The cells into which these nucleic acids havebeen introduced also include the progeny of such cells.

A polypeptide “fragment”, “portion”, or “segment” is a stretch of aminoacid residues of at least about 7 to 19 amino acids (or the minimum sizeretaining an antigenic determinant). A fragment of the present inventioncan comprise a portion of at least 20 amino acids of the proteinsequence, at least 30 amino acids of the protein sequence, at least 40amino acids of the protein sequence, at least 50 amino acids of theprotein sequence, or all or substantially all of the protein sequence.In addition, the invention encompasses polypeptides which comprise aportion of the sequence of the lengths described in this paragraph,which further comprise additional amino acid sequences on the ends or inthe middle of sequences. The additional amino acid sequences can, forexample, comprise another protein or a functional domain thereof, suchas signal peptides, membrane-binding moieties, etc.

A polynucleotide fragment of the present invention can comprise apolymer of at least six bases or basepairs. A fragment of the presentinvention can comprise at least six bases or basepairs, at least 10bases or basepairs, at least twenty bases or basepairs, at least fortybases or base pairs, at least fifty bases or basepairs, at least onehundred bases or basepairs, at least one hundred fifty bases orbasepairs, at least two hundred bases or basepairs, at least two hundredfifty bases or basepairs, at least three hundred bases or basepairs ofthe gene sequence. In addition, the invention encompassespolynucleotides which comprise a portion of the sequence of the lengthsdescribed in this paragraph, which further comprise additional nucleicacid sequences on the 5′ or 3′ end or inserted into the sequence. Theseadditional sequences can, for example, encode a coding region of a geneor a functional domain thereof or a promoter.

The terms “isolated”, “pure”, “substantially pure”, and “substantiallyhomogenous” are used interchangeably to describe a polypeptide, orpolynucleotide which has been separated from components which naturallyaccompany it. A monomeric protein or a polynucleotide is substantiallypure when at least about 60 to 75% of a sample exhibits a singlepolypeptide or polynucleotide sequence. A substantially pure protein orpolynucleotide typically comprises about 60 to 90% by weight of aprotein or polynucleotide sample, more usually about 95%, and preferablywill be over about 99% pure.

Protein or polynucleotide purity or homogeneity may be indicated by anumber of means, such as polyacrylamide gel electrophoresis of a sample,followed by visualizing a single band upon staining the gel. For certainpurposes higher resolution can be provided by using high performanceliquid chromatography (HPLC) or other means well known in the art forpurification.

An RNA or a protein is “isolated” when it is substantially separatedfrom the contaminants which accompany it in its natural state. Thus, apolypeptide which is chemically synthesized or expressed as arecombinant protein, i.e., an expression product of an isolated andmanipulated genetic sequence, is considered isolated. A recombinantpolypeptide is considered “isolated” even if expressed in a homologouscell type.

A polypeptide can be purified from cells in which it is produced by anyof the purification methods known in the art. For example, suchpolypeptides can be purified by immunoaffinity chromatography employing,e.g., the antibodies provided by the present invention. Various methodsof protein purification include, but are not limited to, those describedin Guide to Protein Purification, ed. Deutscher, vol. 182 of Methods inEnzymology Academic Press, Inc., San Diego, 1990 and Scopes, 1982Protein Purification: Principles and Practice Springer-Verlag, New York.

Polypeptide fragments of the protein of the invention are first obtainedby digestion with enzymes such as trypsin, clostripain, orStaphylococcus protease, or with chemical agents such as cyanogenbromide, O-iodosobenzoate, hydroxylamine or 2-nitro-5-thiocyanobenzoate.Peptide fragments can be separated by reversed-phase HPLC and analyzedby gas-phase sequencing. Peptide fragments are used in order todetermine the partial amino acid sequence of a polypeptide by methodsknown in the art including but not limited to, Edman degradation.

The present invention also provides polyclonal and/or monoclonalantibodies capable of specifically binding to a polypeptide of theinvention, or homolog, fragment, complement or derivative thereof.Antibodies can also be produced which bind specifically to apolynucleotide of the invention, such as an RNA of the invention orhomolog, fragment, complement or derivative thereof, and may be producedas described in, for example, Thiry 1994 Chromosoma 103:268-76; Thiry1993 Eur J Cell Biol 62:259-69; Reines 1991 J Biol Chem 266:10510-7;Putterman et al. 1996 J Clin Invest 97:2251-9; and Fournie 1996 Clin ExpImmunol 104:236-40. Antibodies capable of binding to polypeptides orpolynucleotides of the invention can be useful in detecting protein, intitrating protein, for quantifying protein, for purifying protein orpolynucleotide, or for other uses.

For production of polyclonal antibodies, an appropriate host animal isselected, typically a mouse or rabbit. The substantially purifiedantigen, whether the whole polypeptide, a fragment, derivative, orhomolog thereof, or a polypeptide coupled or fused to anotherpolypeptide, or polynucleotide or homolog, derivative, complement orfragment thereof, is presented to the immune system of the host bymethods appropriate for the host, commonly by injection into thefootpads, intramuscularly, intraperitoneally, or intradermally. Peptidefragments suitable for raising antibodies can be prepared by chemicalsynthesis, and are commonly coupled to a carrier molecule (e.g., keyholelimpet hemocyanin) and injected into a host over a period of timesuitable for the production of antibodies. The sera are tested forimmunoreactivity to the protein or fragment. Monoclonal antibodies canbe made by injecting the host with the protein polypeptides, fusionproteins or fragments thereof and following methods known in the art forproduction of such antibodies (Harlow et al. 1988 Antibodies: ALaboratory Manual, Cold Spring Harbor Laboratories).

An immunological response is usually assayed with an immunoassay, avariety of which are provided, e.g., in Harlow et al. 1988, or Goding1986 Monoclonal Antibodies: Principles and Practice, 2d ed., AcademicPress, New York), although any method known in the art can be used.

Monoclonal antibodies with affinities of 10⁸ M⁻¹, preferably 10⁹ to10¹⁰, or stronger are made by standard procedures as described, e.g., inHarlow et al. 1988, or Goding 1986. Briefly, appropriate animals areimmunized with the antigen by a standard protocol. After the appropriateperiod of time, the spleens of such animals are excised and individualspleen cells fused to immortalized myeloma cells. Thereafter the cellsare clonally separated and the supernatants of each clone are tested fortheir production of an appropriate antibody specific for the desiredregion of the antigen.

Other suitable techniques of antibody production include, but are notlimited to, in vitro exposure of lymphocytes to the antigenicpolypeptides or selection of libraries of antibodies in phage or similarvectors (Huse et al. 1989 Science 246:1275-1281).

Frequently, the polypeptides and antibodies are labeled, eithercovalently or noncovalently, with a substance which provides for adetectable signal. A wide variety of labels and conjugation techniquesare known. Suitable labels include, but are not limited to,radionuclides, enzymes, substrates, cofactors, inhibitors, fluorescentagents, chemiluminescent agents, magnetic particles. Also, recombinantimmunoglobulins can be produced by any method known in the art.

Identification of Novel Small RNAs Using Comparative Genomics andMicroarrays

As a starting point for detecting novel sRNAs in E. coli, we considereda number of common properties of the previously identified sRNAs thatmight serve as a guide to identify genes encoding new sRNAs. We aredefining sRNA as relatively short RNAs that do not function by encodinga complete ORF. Of the 13 small RNAs known when this work begun, we werestruck by the high conservation of these genes between closely relatedorganisms. In most cases, the conservation between E. coli andSalmonella was above 85%, whereas that of the typical gene encoding anORF was frequently below 70%. Conservation tests on random noncodingregions of the genome suggested that extended conservation in intergenicregions was unusual enough to be used as an initial parameter to screenfor new sRNA genes. We therefore tested this approach to look for novelsRNAs in the E. coli genome.

All known sRNAs are encoded within intergenic (Ig) regions (defined asregions between ORFs). A file containing all Ig sequences from the E.coli genome (Blattner, F. R. et al. 1997 Science 277:1453-1474) was usedas a starting point for our homology search. We arbitrarily chose the1.0- to 2.5-Mb region of the 4.6-Mb E. coli genome to test and refineour approach and developed the following steps for searching the full E.coli genome.

All Ig regions of 180 nucleotides (nt) or larger were compared to theNCBI Unfinished Microbial Genomes database using the BLAST program(Altschul, S. F. et al. 1990 J Mol Biol 215:403-410). These 1097 Igregions were rated based on the degree of conservation and length of theconserved region when compared to the closely related Salmonella andKlebsiella pneumoniae species. The highest rating was given to Igregions with a high degree of conservation (raw BLAST score of >80) overat least 80 nt (see below for explanation of ratings). Note that mostpromoters do not meet these length and conservation requirements. FIG. 1shows a set of BLAST searches for three known sRNAs (RprA RNA, CsrB RNA,OxyS RNA), three Ig regions with high conservation (#14, #17, #52) andone Ig region with intermediate conservation (#36). Some Ig regions hada large number of matches, often to several chromosomal regions of thesame organism. These Ig regions were noted and many were found tocontain tRNAs, rRNAs, REP, or other repeated sequences. The 40 highlyconserved Ig regions containing tRNAs and/or rRNAs were eliminated fromour search, as these regions were complicated in their patterns ofconservation.

Next the orientation and identity of the ORFs bordering the Ig regionswere determined using the Colibri database, an annotated listing of allE. coli genes and their coordinates. Inconsistencies between the Colibridatabase and our original file led to the reclassification of some Igregions as shorter than 180 nt, and these were not analyzed further. Ofthe remaining 1006 Ig regions, 13 contained known small RNAs, 295 werein the highest conservation group, 88 showed intermediate conservation,and 610 showed no conservation.

The location of the conservation relative to the orientation of theflanking ORFs was an important consideration in choosing candidates forfurther analysis. In many cases (132/295 Ig regions), the conservedregion was just upstream of the start of an ORF, consistent withconservation of regulatory regions, including untranslated leaders.Cases where the conserved region was >50 nt from an ORF start orextended over more than 150 nt in length (RprA RNA, CsrB RNA, OxyS RNA,#17, and #52 in FIG. 1), or where the bordering ORFs ended rather thanstarted at the Ig region (#14 in FIG. 1), were considered bettercandidates for novel sRNAs.

Published information on promoters and other known regulatory siteswithin conserved regions of promising candidates was tabulated and usedto eliminate many candidates in which the conservation could beattributed to previously identified promoter or 5′ untranslated leaders.Finally, the remaining candidate regions were examined for sequenceelements such as potential promoters, terminators, and inverted repeatregions. We considered evidence for possible stem-loops, in particularthose with characteristics of rho-independent terminators, as especiallyindicative of possible sRNA genes (Table 1).

Using these criteria, together with microarray expression data (seebelow), a set of 59 candidates was selected (Table 1). Candidates 1-18were chosen in the first round of screening of the 1.0- to 2.5-Mbregion; some of these candidates would not have met the higher criteriaapplied to the rest of the genome.

TABLE 1 sRNA Candidates Ig Ig Flanking Selection Microarray NorthernInterpretation of No.^(a) Start Length Genes Strand^(b) Criteria^(c)Detection^(d) Detection^(e) Conservation^(f) 1 1019277 359 ompA/sulA < <C (4), S < large known ompA leader 2 1102420 754 csgD/csgB < > C (4), Lnone faint large known csgD leader, promoter 3 1150625 213 fabG/acpP > >C* (4), S > multiple, 300 + nt known acpP mRNA & operon 4 1194145 201ymfC/icd < > C* (0), S > large leader 5 1297345 476 adhE/ychE < > C (4),L none large known adhE leader 6 1298466 740 yhcE/oppA > > C (2), L, S >large + faint others leader, promoter? 7 1328693 376 yciN/topA < > C (4)none large known leader, promoter 8 1407055 480 ydaN/dbpA > > C (4), Lnone none predict sRNA 9 1515024 314 ydcW/ydcX > > C (4), L, S <, > 180nt (<) mRNA, 31 aa ORF 10 1671526 411 ydgF/ydgG < > C (4), L, T nonenone promoter/leader? 11 1755132 313 pykF/lpp > > C (4) > (rif) 300 ntknown lpp mRNA 12 1762411 550 ydiC/ydiH < < C (4), T none 60 nt (<) sRNA13 1860454 341 yeaA/gapA < > C (4), S > (rif) large known gapA leader,promoter? 14 2165049 278 yegQ/orgK > < C (4), L, S > 86 nt (>) sRNA 152276258 335 yejG/bcr < < C (4), L, S < large leader 16 2403093 633nuoA/lrhA < < C (4), L, S < large + 300 nt known processed leader 172588726 540 acrD/yffB > > C (4), S, I < 175, 266 nt (<) mRNA, 19 aa ORF18 1339749 196 yciM/pyrF > > C (3), S* none none promoter/leader? 19450835 462 cyoA/ampG < < C (4), S* > faint large promoter/leader? 20753692 708 gltA/sdhC < > C* (4), S < (rif) faint large known gltA, sdhCleaders 21 986206 605 ompF/asnS < < (4), S*, I, P, T < (rif), > largeknown ompF leader, promoter 22 2651357 823 sseA/sseB > < C (4), L, S, I,T > (rif) 320 nt (>) sRNA 24 3348110 223 elbB/arcB < < C* (4), S* <, >45 nt (>) sRNA 25 3578437 332 yhhX/yhhY < > C (4), L, P, T none 90 nt(<) sRNA 26 3983621 681 aslA/hemY < < C (4), T > 210 nt (>) sRNA 274275510 548 soxR/yjcD > > C (4), L, S*, T < 140 nt (<) sRNA 28 4609568412 osmY/yjjU > > C (4), L, S* <, > (rif) 350 nt (>) mRNA, 53 aa ORF 29454011 346 bolA/tig > > C* (4), S, I > (rif). large leader or operon 30668152 370 ybeB/cobC < < C (4), L, S*, I, P <, > (rif) large (>)leader/promoter? 31 887180 180 ybjK/ybjL > < C (4), L none 80 nt (<)sRNA 32 2590752 343 dapE/ypfH > < C (0), L, S <, > none 66 aa ORF 332967000 684 ygdP/mutH < > C* (4) none none promoter/leader? 34 3672003413 yhjD/yhjE > > C* (4) none none promoter/leader? 35 3719676 284yiaZ/glyS < < C (4), L, P none large leader/promoter? 36 3773784 508mtlR/yibL > > (2), S* < (rif) 500 nt (<) mRNA, 69 aa ORF 37 4638109 402yjjY/lasT > > C (4), L, P, F > none/faint known arcA leader 38 4048313614 yihA/yihI < > C* (4), S, T > (rif) 270 nt (>) sRNA 39 279100 512afaB/yagB < < C (4), L, S* <, > faint large IS30, leader/promoter? 40852161 245 b0816/ybiQ < > C (4), L, P none 205 nt (<) sRNA 41 2974037584 aas/galR < > C (4), L, S, T <, > 89, 83 nt (<) sRNA 42 2781229 432pinH/ypjB < < C (1), L, T none none not conserved 43 3192539 424yqiK/rfaE > < C*(4), L, S <, > none predicted sRNA 44 3245066 347exuR/yqjA > > C (4), L > none promoter/leader? 45 3376287 221 rplM/yhcM< < C* (4), (S), T < (rif) large leader 46 2531398 386 cysK/ptsH > > C(4), S*, T <, > (rif) large known ptsH leader 47 4403561 207purA/yjeB > > C (4), S*, I > large leader/promoter? 48 1239170 391dadX/ycgO > < C (4), L none none IS end 49 1306670 373 cls/kch < < C*(4) none 250 nt (>) mRNA, 57 aa ORF 50 1620541 446 ydeE/ydeH > < C (4),L, I > 185, 220 nt (>) mRNA, 31 aa ORF 51 1903281 377 yobD/yebN > > C(4), L none none promoter/leader? 52 1920997 395 pphA/yebY < < C (4), L,S* <, > 275 nt (>), 100 nt (<) sRNA 53 1932629 237 edd/zwf < < C (4) <none promoter/leader? 54 2085091 263 yeeF/yeeY < < C (4), T < largeleader 55 2151151 740 yegL/yegM < > C (4), L > 143 nt + others (>) sRNA56 2494583 497 ddg/yfdZ > < C (4), L < none known ORF 57 3717395 283yiaG/cspA > > C*(4), S* <, > large known cspA leader 58 4177159 415rplA/rplJ > > C (4), S* <, > large known operon 59 1668974 396ynfM/asr > > (2), S* < none promoter/leader? 60 2033263 591yedS/yedU > > (1), S* <, > none not conserved 61 3054807 394 ygfA/serA >< (1), D <, > 139 nt (>) sRNA ^(a)Candidate numbers. #23 was notanalyzed; the region of conservation corresponds to a published leadersequence. Candidate #61 was added because it is homologous to candidate#43 and the duplicated regions within #55 (see Text and Table 2).^(b)Orientation of flanking genes. > and < denote genes present on theclockwise (Watson) or counterclockwise (Crick) strand of the E. colichromosome, respectively. ^(c)Criteria used for selection of candidates:C, conservation; C*, long conservation; (#), conservation score. Igregions were assigned scores on the basis of BLAST searches (see textbelow). #4 and #32 were rerated from 4 (conserved) to 0 on reanalysis ofthe endpoints of the flanking ORF (#4) and information on an ORF withinthe Ig region (#32). L, location of conservation either far from 5′ endof flanking gene or near 3′ end of gene; S, signal detected inmicroarray experiments; S*, microarray signal on opposite strand toflanking genes; I, inverted repeat; P, predicted promoter; T, predictedterminator; D, duplicated gene. ^(d)Detection on high-densityoligonucleotide probe arrays. > <, orientation of signal as in b. Rif,signals present after 20 min treatment with rifampicin. ^(e)Northernanalysis of RNA extracted from MG1655 cells grown in three conditions(LB medium, exponential phase; minimal medium, exponential phase; LBmedium, stationary phase). Strand specific probes were used for sRNA andmRNAs encoding novel ORFs (orientation noted < or > as in b); doublestranded DNA probes were used for the rest. For #43, bands wereoriginally detected with a double stranded probe, but appear to be fromhomologs (see text). Large, >400 nt. ^(f)Interpretation of highconservation was based on microarray and Northern analyses as well asliterature. mRNAs, small RNA transcripts predicted to encode newpolypeptides (see text), “known leaders”, literature referencessupported the existence of leaders corresponding to conservation. For #37, conservation is consistent with the leader of the arcA gene (Compan,I. and Touati, D. 1994 Mol Microbiol 11: 955-964). The ORF noted for #56is described in Seoane, A. S. and Levy, S. B. 1995 J Bacteriol 177:530-535; and Bouvier, J. et al. 1992 J Bacteriol 174: 5265-5271; seeGenBank entry BAA16347.1. The IS sequence fragment in the conservedregion of #48 is homologous to that described by McVeigh, A. et al. 2000Infect Immun 68: 5710-5715. “leaders”, a large band on Northernanalysis, coupled with conservation near the 5′ end of an ORF.“promoter/leader?”, absence of RNA signal, coupled with conservationnear the 5′ end of a gene, “leader/promoter?”, RNA signal frommicroarray or Northern analyses suggested a leader, while theconservation is far from the expected position of a leader, “leader oroperon”, (for #29) microarray analysis suggested a continuous transcriptthroughout Ig. “predicted sRNAs”, (for #8 and #43) Igs contain thehallmarks expected for an sRNA, but RNA transcripts were not detected.Igs encoding sRNAs also may include leaders; this is not included in theconclusion column.

Selecting Candidate Genes by Whole Genome Expression Analysis

In an independent series of experiments, high-density oligonucleotideprobe arrays were used to detect transcripts that might correspond tosRNAs from Ig regions. Total RNA isolated from MG1655 cells grown tolate exponential phase in LB medium was labeled for probes or used togenerate cDNA probes (see text below). From a single RNA isolation eachlabeling approach was carried out in duplicate and individuallyhybridized to high-density oligonucleotide microarrays. The high-densityoligonucleotide probe arrays used are appropriate for this analysis asthey have probes specific for both the clockwise (Watson) andcounterclockwise (Crick) strands of each Ig region as well as for thesense strand of each ORF. The resulting data from the four experimentswere analyzed to examine global expression within Ig regions, as well asneighboring ORFs.

Our criteria for analyzing the microarray data evolved during the courseof this analysis. Stringent criteria (longer transcripts in the Igregion, higher expression levels) identified many of the previouslyknown sRNAs but did not uncover many strong candidates for new smallRNAs. More relaxed criteria (shorter transcripts, lower expressionlevels) gave a very large number of candidates and therefore were not bythemselves useful as the initial basis for identifying candidates.However, this data was very useful as an additional criterion forselection of candidate regions based on the conservation approach.Detection of a transcript by microarray on the strand opposite to thatof surrounding ORFs was considered a strong indicator of an sRNA (S* inTable 1). Microarray data contributed to the selection of 34 of 59candidates (Table 1). Examples of the different types of expressionobserved in microarray experiments are shown in FIG. 2. Signalcorresponding to CsrB RNA clearly is detected on the Crick (C) strand.#17 and #36 have a transcript in the Ig region on the opposing strand(C) to that for the flanking genes (Watson; W). However, the expressionpatterns were not as obvious in many cases, either because expressionlevels were low or because the pattern of expression could beinterpreted in a number of ways. For instance, very little expressionwas detected for RprA RNA encoded on the W strand, and there isunexplained signal detected from the opposite strand of the rprA andcsrB Ig regions. #14 and #52 also had some expression on each strand(FIG. 2). #14 proved to express a small RNA from the Watson strand,while #52 expresses sRNAs from each strand (see below and Table 2).

Given that a number of the known sRNAs are relatively stable, we testedwhether selection for stable RNAs might allow the microarray data to bemore useful for de novo identification of sRNA candidates. Thetranscription inhibitor rifampicin was added to cells for 20 min priorto harvesting the RNA with the intention of enriching for stable RNAs.Many of the known sRNAs can be detected after the rifampicin treatment.Of the 59 candidates in Table 1 twelve retained a hybridization signal(marked rif in Table 1), and four of these proved to correspond to smalltranscripts (see below). Other rif resistant transcripts detected in Igregions appeared to be due to highly expressed leaders.

TABLE 2 Novel sRNAs and Predicted Small ORFs^(a) Effect on RNA HfqrpoS-lacZ^(h) No. Gene Minute Size^(b,c,d) Strand^(e) Expression^(f)Binding^(g) S M Other Information^(j) 12 rydB 38  60^(b) < < < M >> S >E NT 0.4 1.0 14 ryeE 47  86^(b) > > < E, S > M + (E) 0.25 1.2 borderedby cryptic prophage 22 ryfA 57 320^(c) > > < E, M NT NT NT PAIR3 (Rudd,K. E. 1999 Res Microbiol 150: 653-664) 24 ryhA 72  45^(b) < > < S >> M >E + (S) 1.0 1.9 105, 120 nt, present S >> M > E 105 nt binds Hfq (+, S)25 ryhB 77  90^(b) < < > M >> S + (M) 1.2 0.4 multicopy plasmidrestricts growth on succinate 26 ryiA 86 210^(b) < > < E > M, S + (E)0.9 1.5 155 nt, present M > E, S 27 ryjA 92 140^(b) > < > S >> M − (S)NT NT 31 rybB 19  80^(b) > < < S >> M + (S) 1.0 2.3 38 ryiB 87 270^(b)< > > M > S >> E − (M) 1.0 1.6 CsrC (Romeo, pers. commun.) 40 rybA 18205^(b) > < > S > M > E − (S) 1.2 1.5 ladder up from 255, 300 nt,present S > M > E   41-I rygA 64  89^(b) < < > S >> M, E + (S) 1.3^(i)1.7^(i) PAIR2 (Rudd, K. E. 1999 Res Microbiol 150: 653-664)  41-II rygB64  83^(b) < < > S, E > M + (S) 1.3^(i) 1.7^(i) PAIR2 (Rudd, K. E. 1999Res Microbiol 150: 653-664)   52-I ryeA 41 275^(b) < > < M > E > S −/+(M) 1.1^(i) 1.0^(i) 148, 152, 180 nt (+ others), present M, S  52-IIryeB 41 100^(b) < < < S >> M + (S) 1.1^(i) 1.0^(i) 70 nt, present S >> M  55-I ryeC 46 143^(c) < > > S > M > E NT 1.2 1.6 QUAD1a (Rudd, K. E.1999 Res 107^(c) M > E, S Microbiol 150: 653-664)  55-II ryeD 46 137^(c)< > > M > E > S NT NT NT QUAD1b (Rudd, K. E. 1999 Res 102^(c) M > EMicrobiol 150: 653-664) 61 rygC 65 139^(c) > > < S >> M > E NT NT NTQUAD1c (Rudd, K. E. 1999 Res 107^(c) S, M > E Microbiol 150: 653-664)  8rydA 30 139^(d) > (>) > none NT NT NT Expression not detected; predictedsRNA 43 rygD 69 143^(d) > (<) < none NT NT NT QUAD1d (Rudd, K. E. 1999Res Microbiol 150: 653-664) Expression not detected  9 yncL 32 180^(b) >< > S > M > E +/− (S) NT NT 31 aa ORF 17 ypfM 55 266^(b) > < > E >> M−/+ (E) 2.0 1.5 19 aa ORF 175 nt, present E, M 28 ytjA 99 305^(b) > > >S > M NT NT NT 53 aa ORF 36 yibT 81 500^(b) > < > S >> E, M NT 1.3 1.069 aa ORF 49 yciY 28 250^(b) < > < E, M NT NT NT 57 aa ORF 50 yneM 35185^(b) > > < S NT NT NT 31 aa ORF 220^(b) M > E ^(a)Table is dividedinto three sections: detected sRNAs, predicted sRNAs and detected RNAspredicted to encode small ORFs. ^(b,c,d)RNA sizes estimated fromNorthern analyses using ^(b)single stranded RNA probes or^(c)oligonucleotide probes, or ^(d)from predictions resulting fromsequence analysis (see text). ^(e)> < denotes orientation of sRNA andflanking genes as in Table 1. ^(f)Relative expression in three growthconditions: E, LB medium, exponential phase; M, minimal medium,exponential phase; and S, LB medium, stationary phase. ^(g)RNAcoimmunoprecipitation with Hfq as detected by Northern analysis: +,strong binding (>30% of RNA bound); +/−, weak binding (5-10%); −/+,minimal binding (<5%), and −, no detectable binding. E, M, S refer tocell growth conditions as in f. NT, not tested. ^(h)Expression ofrpoS-lacZ fusion in the presence of multicopy plasmids carryingintergenic regions. Activity was measured in stationary phase in LBmedium (S) or minimal medium (M) and normalized to the activity of thevector control in the same experiment. In parallel experiments, cellscarrying the vector alone gave 1.3-2 (S) and 0.7-2.6 (M) units, cellscarrying pRS-DsrA plasmid gave a 4.9 fold increase (S) and 12 foldincrease (M); cells carrying pRS-RprA plasmid gave 3.1 fold (S) and 3.3fold (M) increase. Results in table are average of at least threeindependent assays. Values in bold were considered significantlydifferent from the control. NT, not tested. ^(i)#41 and #52 each expresstwo sRNAs so it is not possible to assign a phenotype to a given smallRNA. Thus far there is no evidence for a strong phenotype for eithercandidate. ^(j)Included is information about additional RNA bandsdetected in Northern analysis.

Small RNA Transcripts Detected by Northern Hybridization

The final test for the presence of an sRNA gene was the direct detectionof a small RNA transcript. The candidates in Table 1 were analyzed byNorthern hybridization using RNA extracted from MG1655 cells harvestedfrom three growth conditions (exponential phase in LB medium,exponential phase in M63-glucose medium, or stationary phase in LBmedium). The microarray analysis discussed above used RNA isolated fromcells grown to late exponential phase in LB medium, which isintermediate between the two LB growth conditions used for the Northernanalysis. Initially, Northern analysis was carried out usingdouble-stranded DNA probes containing the full Ig region for mostcandidates. In three cases (#8, #22, and #55) PCR amplification of theIg region to generate a probe was not successful and thereforeoligonucleotide probes were used for Northern analysis. Seventeencandidates gave distinct bands consistent with small RNAs, and oneadditional candidate gave a somewhat larger RNA, but the location ofconservation was not consistent with a leader sequence for a flankingORF (#36). In some of these cases, two or more RNA species were detectedwith a single Ig probe (Table 2, see also FIG. 3). One candidate (#43)gave a signal with the double stranded DNA probe, but contains regionsduplicated elsewhere in E. coli that probably account for this signal(see below). Of the remaining 41 candidates, 17 gave no detectabletranscript. These Ig regions could encode sRNAs expressed only undervery specific growth conditions. For instance, #8 has all the sequencehallmarks of an sRNA gene (a well-conserved region preceded by apossible promoter and ending with a terminator), but has not beendetected. Alternatively, the observed conservation could be due tonontranscribed regulatory regions. Fairly large RNAs were detected foranother 24 candidates. Given the size of these transcripts together withdata on the orientation of flanking genes and the location of conservedregions, it is likely these are leader sequences within mRNAs (Table 1).

For candidates expressing RNAs not expected to be 5′ untranslatedleaders, Northern analysis was carried out with strand-specific probesto determine gene orientation (FIG. 3). For many of the candidates, weused sequence elements (see below) as well as expression informationfrom the microarray experiments to predict which strand was most likelyexpressed; both strands were tested when predictions were unclear. Theresults from the strand-specific probes generally agreed withpredictions and were used to estimate the RNA size (Table 2).Interestingly, in one case there is an sRNA expressed from both the Wand C strand within the Ig (#52; FIG. 3). For #12, although no sRNA hadbeen detected using a double stranded DNA probe, the presence of apotential terminator and promoter remained suggestive of the presence ofan sRNA gene. Therefore, oligonucleotide probes also were used inNorthern analysis of this candidate, and a small RNA transcript wasdetected (FIG. 3; Table 1).

Examination of expression profiles of the RNAs under different growthconditions gave an indication of specificity of expression. Somecandidates were detected under all three growth conditions; others werepreferentially expressed under one growth condition (FIG. 3; Table 2).For instance, #25 was present primarily during growth in minimal medium,consistent with the absence of detection in the whole genome expressionexperiment, which analyzed RNA isolated from cells grown in rich medium.

Sequence Predictions of sRNA Genes and ORFs

For the candidates expressing small RNA transcripts, the conservedsequence blocks (contigs) from K. pneumoniae, the highest conservedSalmonella species, and in a few cases Yersinia pestis, were selectedfrom the NCBI Unfinished Microbial Genome database and aligned with theE. coli Ig region using GCG Gap (Devereux, J. et al. 1984 Nucleic AcidsRes 12:387-395). Multiple alignments were assembled by hand, and theconserved regions were examined for likely promoters and terminators andother conserved structures. Information from the alignments, togetherwith results from strand-specific Northern and microarray expressionanalyses, allowed assignments of gene orientation, putative regulatoryregions, and RNA length from the predicted starting and endingpositions. Where a terminator sequence was very apparent (13 of 19candidates), transcription was assumed to end at the terminator, and theobserved size of the transcript was used to help identify possiblepromoters. The identification of promoters and terminators was lessdefinite when there was only one species with conservation to E. coli.

As the alignments were assembled, the pattern of conservation in somecases was reminiscent of patterns expected from ORFs, with highersequence variation in positions consistent with the third nucleotide ofcodons. GCG Map (Devereux, J. et al. 1984 Nucleic Acids Res 12:387-395)was used to predict translation in all frames for all of the candidatesmall RNAs. In six cases, the conservation and translation potentialsuggested the presence of a short ORF. In these cases, aribosome-binding site and the potential ORF were well conserved, withthe most variation in the third position of codons, but other elementsof the predicted RNA were less well conserved. For example, #17expresses an RNA of about 266 nt, containing a predicted ORF of only 19amino acids. Within the predicted Shine-Delgarno sequence and ORF, only9/80 positions showed variation for either Klebsiella or Salmonella,while the overall RNA is less than 60% conserved. We predict that for#17, as well as five others (Table 2), the detected RNA transcript isfunctioning as an mRNA, encoding a short, conserved ORF. An evaluationof both the new predicted ORFs and the untranslated sRNAs with GLIMMER,a program designed to predict ORFs within genomes, gave completeagreement with our designations (Delcher, A. L. et al. 1999 NucleicAcids Res 27:4636-4641).

We have assigned gene names to all candidates that we have confirmed areexpressed as RNAs (see Table 2). The genes we predict to encode ORFswere given names according to accepted practice for ORFs (Rudd, K. E.1998 Microbiol Mol Biol Rev 62:985-1019). The genes that express sRNAswithout evidence of conserved ORFs were named with a similarnomenclature: ryx, with Ty denoting RNA and x indicating the 10 mininterval on the E. coli genetic map.

We noted one instance of overlap in sequence between our new sRNAs. Theconserved region within #43 is highly homologous to a duplicated regionwithin #55, as well as to a fourth region of the chromosome within amore poorly conserved Ig (#61 in Table 1). This repeated region waspreviously denoted the QUAD repeat and suggested to encode sRNAs (Rudd,K. E. 1999 Res Microbiol 150:653-664). Each of the QUAD repeats containsa short stretch homologous to boxC, a repeat element of unknown functionpresent in 50 copies or more within the genome of E. coli (Bachellier,S. et al. 1996 Repeated Sequences In: Escherichia coli and Salmonella:Cellular and Molecular Biology eds. F. C. Neidhardt, et al. pp.2012-2040 American Society for Microbiology, Washington, D.C.). Ruddalso has detected transcripts from the QUAD regions. To determine whichof the four QUAD genes was being expressed, we designed oligonucleotideprobes unique for each of the four repeats. These oligonucleotide probesdemonstrated expression for three of the four QUAD genes (#55-I, #55-II,and #61); furthermore, each gave two RNA bands (FIG. 3; Table 2). Nosignal was detected for the fourth repeat (#43). The #41 Ig regionencodes another pair of repeats, PAIR2 (Rudd, K. E. 1999 Res Microbiol150:653-664), and we observed two RNA species, suggesting that each ofthe repeats may be transcriptionally active. Finally, another repeatregion noted by Rudd, PAIR3, is encoded by the #22 Ig region.

Many sRNAs Bind Hfq and Modulate rpoS Expression

Hfq is a small, highly abundant RNA-binding protein first identified forits role in replication of the RNA phage Qβ (Franze de Fernandez, M. etal. 1968 Nature 219:588-590; reviewed in Blumenthal, T. and Carmichael,G. G. 1979 Annu Rev Biochem 48:525-548). Recently, Hfq has been shown tobe involved in a number of RNA transactions in the cell, includingtranslational regulation (rpoS), mRNA polyadenylation, and mRNAstability (ompA, mutS, and miaA) (Muffler, A. et al. 1996 Genes & Dev10:1143-1151; Tsui, H.-C. T. et al. 1997 J Bacteriol 179:7476-7487;Vytvytska, O. et al. 1998 PNAS USA 95:14118-14123; Hajndsorf, E. andRegnier, P. 2000 PNAS USA 97:1501-1505; Vytvytska, O. et al. 2000 Genes& Dev 14:1109-1118). Three of the known E. coli sRNAs regulate rpoSexpression: DsrA RNA and RprA RNA positively regulate rpoS translation,whereas OxyS RNA represses its translation. In all three cases the Hfqprotein is required for regulation (Zhang, A. et al. 1998 EMBO J17:6061-6068; Majdalani, N. et al. 2001 Mol Microbiol 39:1382-1394;Sledjeski, D. D. et al. 2001 J Bacteriol 183:1997-2005), and bindingstudies have revealed a direct interaction between Hfq and the OxyS andDsrA RNAs (Zhang, A. et al. 1998 EMBO J 17:6061-6068; Sledjeski, D. D.et al. 2001 J Bacteriol 183:1997-2005).

Given the interaction of the Hfq protein with at least three of theknown sRNAs, we asked how many of the newly discovered sRNAs are boundby this protein. Hfq-specific antisera was used to immunoprecipitateHfq-associated RNAs from extracts of cells grown under the conditionsused for the Northern analysis. Total immunoprecipitated RNA wasexamined using two methods. First, RNA was 3′-end labeled and selectedRNAs were visualized directly on polyacrylamide gels. Under each growthcondition, several RNA species co-immunoprecipitated with Hfq-specificsera but not with preimmune sera, which indicates that many sRNAsinteract with Hfq (FIG. 4A). Second, selected RNAs were examined usingNorthern hybridization to determine whether other known sRNAs and any ofour newly discovered sRNAs interact with Hfq. For each sRNA, Hfq bindingwas examined under growth conditions where the sRNA was most abundant(FIG. 4B; Table 2). sRNAs present in samples using the Hfq antisera butnot preimmune sera were concluded to interact with Hfq. Comparison oflevels of a selected sRNA relative to the total amount of that sRNA inthe extract revealed that many of the sRNAs bound Hfq quite efficiently(>30% bound) (#14, #24, #25, #26, #31, #41, #52-II, Spot42 RNA, and RprARNA), but other sRNAs bound Hfq less efficiently (<10% bound) (#9, #17,and #52-I), or not at all (#27, #38, #40, 6S RNA, 5S RNA, and tmRNA)(FIG. 4; Table 2).

As mentioned above, at least three of the known sRNAs that interact withHfq also regulate translation of rpoS, the stationary phase sigmafactor. In light of the fact that many of the new sRNAs also interactwith Hfq, we examined whether these new sRNAs affect rpoS expression.Plasmids carrying the Ig regions encoding either control sRNAs (pRS-DsrAand pRS-RprA) or many of our novel sRNAs were introduced into an MG1655Δlac derivative carrying a rpoS-lacZ translational fusion. We thencompared expression of the rpoS-lacZ fusion in these cells to cellscarrying the control vector by measuring β-galactosidase activity atstationary phase in LB or M63-glucose medium (Table 2). As expected,overproduction of either DsrA RNA or RprA RNA increased rpoS-lacZexpression significantly (Table 2 legend). A number of plasmids(pRS-#24, pRS-#31) led to increased rpoS-lacZ expression, whereas others(pRS-#12, pRS-#14, and pRS-#25) led to decreased expression. Theseresults indicate that the corresponding sRNAs may directly regulate rpoSexpression or indirectly affect rpoS expression by altering Hfqactivity, possibly by competition. Intriguingly, there is not a completecorrelation between Hfq binding and altered rpoS-lacZ expression inthese studies.

As another strategy in defining possible functions for the sRNAs, wescreened strains carrying the multicopy plasmids for effects on growthin LB medium at various temperatures as well as growth in minimal mediumcontaining a number of different carbon sources. pRS-#25 renders cellsunable to grow on succinate in agreement with predictions for #25 RNAinteraction with sdh mRNA (discussed below). We were unable to isolateplasmids carrying the #27 Ig region without mutations, indicating thatoverproduction of this small RNA may interfere with growth. No othergrowth phenotypes were observed. A caveat for the interpretation ofresults with the multicopy plasmids is that they contain the fullintergenic region, therefore we cannot rule out effects of sequencesoutside the sRNA genes but within the intergenic regions.

In summary, a multifaceted search strategy to predict sRNA genes wasvalidated by our discovery of 17 novel sRNAs. Northern analysisdetermined that 44 of 60 candidate regions express RNA transcripts, someof them expressing more than one RNA species. Of these transcripts, 24were concluded to be 5′ untranslated leaders for mRNAs of flankinggenes, and another six are predicted to encode new, short ORFs (Tables 1and 2). The 17 transcripts believed to be novel, functional sRNAs rangefrom 45 nt to 320 nt in length and vary significantly in expressionlevels and expression profiles under different growth conditions. Morethan half of the new sRNAs were found to interact with the RNA-bindingprotein Hfq, indicating that Hfq binding may be a definingcharacteristic of a family of prokaryotic sRNAs.

Evaluation of Selection Criteria

Three general approaches for predicting sRNA genes were evaluated inthis work. In the primary approach, Ig regions were scored for degreeand length of conservation between closely related bacterial speciesfollowed by examination of sequence features. This approach proved to bevery productive in identifying Ig regions encoding novel sRNAs in E.coli; more than 30% of the candidates selected primarily on the basis oftheir conservation proved to encode novel small transcripts. Theavailability of nearly completed genome sequences for Salmonella andKlebsiella made this approach possible. Any organism for which thegenome sequences of closely related species are known can be analyzed inthis way. Comparative genomics of this sort have been used before tosearch for regulatory sites (for review, see Gelfand, M. S. 1999 ResMicrobiol 150:755-771), but have not been employed previously to findsRNAs.

Although we found the conservation-based approach to be the mostproductive in identifying sRNA genes, we note a number of limitations toits use. A high level of conservation is not sufficient to indicate thepresence of an sRNA gene. Many of the most highly conserved regions, notunexpectedly, were consistent with regulatory and leader sequences forflanking genes. We also did not analyze any Ig regions whereconservation was attributable to sources other than an sRNA. Forexample, potential sRNAs processed from mRNAs, or any sRNAs encoded bythe antisense strand of ORFs or leaders, may have been missed in ourapproach. We made the assumption that Ig regions must be ≧180 nt toencode an sRNA of ≧60 nt, a 50-60-nt promoter and regulatory region tocontrol expression of the sRNA, as well as regulatory regions forflanking genes. Any sRNA genes in smaller Ig regions would have beenoverlooked. We also excluded the highly conserved tRNA and rRNA operonsfrom our consideration because of their complexity. It is certainlypossible that sRNA genes may be associated with these other RNA genes.In fact, sRNA genes have been predicted to be encoded in at least onetRNA operon. In addition, conservation need not be a property of allsRNAs. We expect sRNAs that play a role in modulating cellularmetabolism to be well conserved, as is the case for the previouslyidentified sRNAs. Nevertheless, sRNAs may be encoded within or act uponregions for which there is no homology between E. coli, Klebsiella, andSalmonella (e.g., in cryptic prophages and pathogenicity islands), andthey would be missed by this approach. Only one of 24 Ig regions withinthe e14, CP4-54, or CP4-6 prophages showed conservation. A few of theseIg regions showed evidence of transcription by microarray analysis, andRNAs have been implicated in immunity regulation in phage P4 (Ghisotti,D. et al. 1992 Mol Microbiol 6:3405-3413), which is related to theprophages CP4-54 and CP4-6. Despite the limitations listed above,however, we believe the use of conservation provides a relatively quickidentification of the majority of sRNAs.

An alternative genomic sequence-based strategy for identifying sRNAswould be to search for orphan promoter and terminator elements as wellas other potential RNA structural elements. Potential promoter elementswere generally too abundant to be useful predictors without otherinformation on their expected location and orientation. We foundsequences predicted to be rho-independent terminators a more usefulindicator of sRNAs; such sequences were clearly present for 13/17 of thesRNAs and 3/6 of the new mRNAs. In a number of cases, it appears thatthe sRNAs share a terminator with a convergent gene for an ORF. In othercases, either no terminator was detected or it appeared to be in aneighboring ORF. A search using promoter and terminator sequences as therequirements for identifying sRNAs might therefore have found two-thirdsof the sRNAs described here. Phage integration target sequences alsocould be scanned for nearby sRNA genes. Many phage att sites overlaptRNAs (reviewed in Campbell, A. M. 1992 J Bacteriol 174: 7495-7499), andssrA, encoding the tmRNA, has a 3′ structure like a tRNA and overlapsthe att site of a cryptic prophage (Kirby, J. E. et al. 1994 J Bacteriol176:2068-2081). In this work, we found that the 3′ end and terminator of#14 overlaps the previously mapped phage P2 att site (Barreiro, V. andHaggard-Ljungquist, E. 1992 J Bacteriol 174:4086-4093). #14 sRNA doesnot obviously resemble a tRNA, suggesting that the overlap between phageatt sites and RNA genes extends beyond tRNAs and related molecules andmay be common to additional sRNAs.

Our second approach, high-density oligonucleotide probe array expressionanalysis, proved to be more useful in confirming the presence of sRNAgenes first found by the conservation approach than in identifying newsRNA genes de novo. Further consideration of the location of microarraysignal compared to flanking genes as well as analysis of microarraysignals after a variety of growth conditions should expand the abilityto detect sRNAs in this manner. Under a single growth condition, signalconsistent with the RNA identified by Northern analysis was detected for5/15 of the Ig regions proven to encode new sRNAs and for 4/6 of the newmRNAs. Thus, a similar analysis of microarray data in nonconservedgenomic regions might help in the identification of sRNAs missed by theconservation-based approaches. We predict that sRNAs from any organismexpressed at reasonably high levels under normal growth conditions willbe detected by microarrays that interrogate the entire genome, inclusiveof noncoding regions.

One clear limitation in detecting sRNAs with microarray or Northernanalyses is the fact that some sRNAs may be expressed only under limitedgrowth conditions or at extremely low levels. We chose three growthconditions to scan our samples. While most of the previously known sRNAswere seen under these conditions, OxyS RNA, which is induced byoxidative stress, was not detectable. For a few of our candidates inwhich no RNA was detected, it is possible that an sRNA is encoded but isnot expressed sufficiently to be detected under any of our growthconditions. Another possible limitation of hybridization-basedapproaches is that highly structured sRNAs may be refractory to probegeneration. sRNA transcripts may not remain quantitatively representedafter the fragmentation used in the direct labeling approach here. cDNAlabeling also may underrepresent sRNAs because they are a small targetfor the oligonucleotide primers, and secondary structure can interferewith efficiency of extension.

As our third approach, sRNAs were selected on the basis of their abilityto bind to the general RNA binding protein, Hfq. Northern analysisrevealed that many of our novel sRNAs interact with Hfq. In preliminarymicroarray analysis of Hfq-selected RNAs to look for additional unknownsRNAs, DsrA RNA, DicF RNA, Spot42 RNA, #14, #24, #25, #31, #41, and#52-II were detected among those RNAs with the largest difference inlevels between Hfq-specific sera and pre-immune sera. This preliminaryexperiment suggests that microarray analysis of selected RNAs will bevery valuable on a genome-wide basis. Interestingly, a large number ofgenes with leaders and a number of RNAs for operons were found toco-immunoprecipitate with Hfq (including the known Hfq target nlpD-rpoSmRNA (Brown, L. and Elliott, T. 1996 J Bacteriol 178:3763-3770). Itseems likely that the subset of sRNAs binding a common protein willrepresent a subset in terms of function; the sRNAs of known functionassociated with Hfq in our experiments appear to be those involved inregulating mRNA translation and stability. Other sRNAs have been shownto interact with specific prokaryotic RNA-binding proteins, for example,tmRNA with SmpB (Karzai, A. W. et al. 1999 EMBO J 18:3793-3799), and thepossibility of other sRNAs interacting with these proteins or othergeneral sRNA-binding proteins should be tested. This approach isadaptable to all organisms, and, in fact, binding to Sm and Fibrillarinproteins has been the basis for identification of several sRNAs ineukaryotic cells (Montzka, K. A. and Steitz, J. A. 1988 PNAS USA85:8885-8889; Tyc, K. and Steitz, J. A. 1989 EMBO J 8:3113-3119).

All the criteria we used to identify sRNAs also will detect short genesencoding new small peptides, and we have found six conserved short ORFs.Although our approach was intended to develop methods to identifynon-translated genes within the genome, short ORFs also are missing fromannotated genome sequences. The combination of a requirement forconservation and/or transcription with sequence predictions for ORFsshould add significantly to our ability to recognize short ORFs. Smallpolypeptides have been shown to have a variety of interesting cellularroles. We expect that the short ORFs we have found are involved insignaling pathways, akin to those of B. subtilis peptides that enter themedium and carry out cell-cell signaling (reviewed in Lazazzera, B. A.2000 Curr Opin Microbiol 3:177-182).

Characteristics and Functions of New sRNAs

The current work serves as a blueprint for the prediction, detection,and characterization of a large group of novel sRNAs. We have definitiveinformation on characteristics that provide information regarding thecellular roles of these new sRNAs. Several known sRNAs that bind the Hfqprotein act via base pairing to target mRNAs. The finding that a numberof our new sRNAs bind Hfq indicates a similar mechanism of action forthis subset of sRNAs. We searched the E. coli genome for possiblecomplementary target sequences and examined phenotypes associated withmulticopy plasmids containing new sRNA genes. Intriguingly, #25, an sRNApreferentially expressed in minimal medium, has extended complementarityto a sequence near the start of sdhD, the second gene of the succinatedehydrogenase operon. When the #25 Ig region is present on a multicopyplasmid, it interferes with growth on succinate minimal medium (Table2), consistent with #25 sRNA acting as an antisense RNA for sdhD.Complementarity to many target mRNAs was found for a number of othernovel sRNAs, confirming the validity of this analysis.

As outlined in the evaluation of each of our approaches, we do notexpect our searches have been exhaustive. sRNAs also have been detectedby others using a variety of approaches. The sRNA encoded by #38 wasindependently identified as a regulatory RNA (CsrC RNA; T. Romeo, pers.comm.), and others have found additional sRNAs using variations of theapproaches used here (Argaman, L. et al. 2001. Curr. Biol. in press).Nevertheless, we think it unlikely that there are many more than 50sRNAs encoded by the E. coli chromosome and by closely related bacteria.We expect such sRNAs to be present and playing important regulatoryroles in all organisms. Using the approaches described here, it isfeasible to search all sequenced organisms for these importantregulatory molecules. We anticipate that study of the expanded list ofsRNAs in E. coli will allow a more complete understanding of the rangeof roles played by regulatory sRNAs.

EXAMPLE 1 Computer Searches

Ig regions are defined here as sequences between two neighboring ORFs.We compared Ig regions of ≧180 nt against the NCBI Unfinished MicrobialGenomes database (Worldwide web atncbi.nlm.nih.gov/Microb_blast/unfinishedgenome.html) using the BLASTprogram (Altschul, S. F. et al. 1990 J Mol Biol 215:403-410). Salmonellaenteritidis sequence data were from the University of Illinois,Department of Microbiology (Worldwide web at salmonella.org). Salmonellatyphi and Yersinia pestis sequence data were from the Sanger Centre(Worldwide web at sanger.ac.uk/Projects/S_typhi/ andsanger.ac.uk/Projects/Y_pestis/). Salmonella typhimurium, Salmonellaparatyphi, and Klebsiella pneumoniae sequences were from the WashingtonUniversity Genome Sequencing Center.

Each Ig region was rated based on the best match to Salmonella or K.pneumoniae species. Ig regions containing previously identified sRNAswere rated 5 (each of them met the criteria to be rated 4). Ig regionswere rated 4 if the raw BLAST score was >200 (hatched bars in FIG. 1) or80-200 (double-diagonal bars in FIG. 1) extending for >80 nt; 3 if theraw BLAST score was 80-200 (double-diagonal bar) extending for 60-80 nt;2 if the raw BLAST score was 50-80 (diagonal bar) extending for >65 nt;and 1 if the raw BLAST score was <50 (diagonal-dash, solid or none) or<65 nt. The location of the longest conserved section(s) within each Igand the number of matches to the NCBI Unfinished Microbial database wererecorded. Note that the computer searches were done from May 2000 toDecember 2000; more sequences are expected to match as the databasecontinues to expand. The identity and orientation of genes flanking eachIg region were determined from the Colibri database (using http:// forgenolist.pasteur.fr/Colibri). Ig regions that the Colibri databasepredicted to be <180 nt in length and Ig regions containing tRNA and/orrRNAs were rated 0 and removed from further consideration.

Strains and Plasmids

Strains were grown at 37° C. in Luria-Bertani (LB) medium or M63 minimalmedium supplemented with 0.2% glucose and 0.002% vitamin B1 (Silhavy, T.J. et al. 1984 Experiments with gene fusions Cold Spring HarborLaboratory, Cold Spring Harbor, N.Y.) except for phenotype testing ofstrains carrying multicopy plasmids as described below. Ampicillin (50μg/ml) was added where appropriate. E. coli MG1655 was the parent forall strains used in this study. MG1655 Δlac (DJ480, obtained from D.Jin, NCI), was lysogenized with a λ phage carrying an rpoS-lacZtranslational fusion (Sledjeski, D. D. et al. 1996 EMBO J 15:3993-4000)to create strain SG30013.

To generate clones containing the Ig region of each candidate (pCR-#Nwhere N refers to candidate number; see Table 1), Ig regions wereamplified by PCR from a MG1655 colony and cloned into the pCRII vectorusing the TOPO TA cloning kit (Invitrogen). Oligonucleotides weredesigned so the entire conserved region and in most cases the full Igregion was included. In a few cases, repeated sequences or otherirregularities required a reduction in the Ig regions cloned. See Table3 for a list of all oligonucleotides used in this paper. Ig regionsencoding sRNAs also were cloned into multicopy expression vectors(pRS-#N) in which each Ig region is flanked by several vector-encodedtranscription terminators. To generate pRS-#N plasmids, pCR-#N plasmidswere digested with BamHI and XhoI and the Ig-containing fragments werecloned into the BamHI and SalI sites of pRS1553 (Pepe, C. M. et al. 1997J Mol Biol 270:14-25), replacing the lacZ-α peptide. To constructpBS-spot42, the Spot42-containing fragment was amplified by PCR from K12genomic DNA, digested with EcoRI and BamHI and cloned into correspondingsites in pBluescript II SK⁺ (Stratagene). All DNA manipulations werecarried out using standard procedures. All clones were confirmed bysequencing.

TABLE 3 Oligonucleotides SEQ Oligo ID Candidate Name Sequence NO NumberKW-39 GCGCCTCGTTATCATCCAAAATACG  32  #1 KW-40 GTCGCCCAGCCAATGCTTTCAGTCG 33 KW-41 ATTGATCGCACACCTGACAGCTGCC  34  #2 KW-42GTTGTCACCCTGGACCTGGTCGTAC  35 KW-43 TGACCGCGATTTGCACAAAATGC  36  #3KW-44 ACTCTTAAATTTCCTATCAAAACTCGC  37 KW-45 GGTATTTTCAGAGATTATGAATTGCCG 38  #4 KW-46 TCACCTCTCCTTCGAGCGCTACTGG  39 KW-47AATGCTCTCCTGATAATGTTAAACTT  40  #5 KW-48 GGTTAGCTCCGAAGCAAAAGCCGGAT  41KW-49 TAATTCCTTTCAAATGAAACGGAGC  42  #6 KW-50 GGACTCCCTCATTATAATTACTGG 43 KW-51 CTCCTTAAACAAGGACATTAGTCTACG  44  #7 KW-52ATTCACCTTACCTAATTTGATTCTTCC  45 KW-123 CCATCGCTTGACGTTGCATTCACCTGC  46 #8(probes) KW-124 GTCGGCGTCGTACGAATCAATTGTGC  47 KW-125GCACAATTGATTCGTACGACGCCGAC  48 KW-55 TAAGGATAATATTGCAGATCGTAAG  49  #9KW-56 ATCATCAAACAGCAACTTGCCC  50 KW-57 TGTCCTTCTCCTGCAAGAGAATTATT  51#10 KW-58 GCTAATAATAATGTCTTTTTCGCTCC  52 FR-100GCTTTTGTGAATTAATTTGTATATCGAAGCG  53 #11 FR-101TATTAATACCCTCTAGATTGAGTTAATC  54 FR-102 CGATTTACCTCACTTCATCGCTTTCAG  55#12 FR-103 TGATCCTGACTTAATGCCGCAAGTTC  56 FR-104GCTTATCTCCGGCACTCTCAGTGGCTTAGCTCT  57 (probe) TGAAGG FR-105TTGCTCACATCTCACTTTAATCGTGCTC  58 #13 FR-106ATATTCCACCAGCTATTTGTTAGTGAATAAAAG  59 G FR-107TGATTAATTTCGATTATTTTTCCCGGATGG  60 #14 FR-108ATTAGAAACAGGAAGCCCCTCAGTCGAG  61 FR-109 TTATTTTCCCCGGAAGCACATTCACTTCAC 62 #15 FR-110 TGATCTATTGCACAACGAGGAAGC  63 FR-111TGCTTACTCATCAAAAGTAGCGCCAGATTC  64 #16 FR-112TAATCGACGGACGATAGATAATTCCTG  65 FR-113 CCAATGTGTCGCCTTTTTCAACTTTCCG  66#17 FR-114 CGATTTATGAGAATAAATACTCATTTAAGGGTG  67 FR-115AAATCCGACTTTAGTTACAACATAC  68 #18 FR-116 GACCAGACCTTCTTGATGATGGGCAC  69KW-69 CGACCTCAATTCCACGGGATCTGG  70 #19 KW-70 ATTTAGCTGTAGTAATCACTCGCCG 71 KW-71 GGTCTCCTTAGCGCCTTATTGCG  72 #20 KW-72CGCCCACATGCTGTTCTTATTATTCCC  73 KW-73 TTTATGACACCTGCCACTGCCGTC  74 #21KW-74 CTGTCAAGTTATCTGTTTGTTAAGTCAAGC  75 KW-126GCTGTGAAGCACCTGCGTTGCTCATG  76 #22 KW-127 GCTGTGAAACACCTGCATTTACGGCCACGG 77 (probes) KW-128 CCGTGGCCGTAAATGCAGGTGTTTCACAGC  78 KW-77CCTTTCGCAATTGACTGAAACAC  79 #24 KW-78 GGCTAGACCGGGGTGCGCG  80 KW-79AAGGTGGTTATTTACACCTTAGCG  81 #25 KW-80 GTCCTCTTTGGGGTAAATGTC  82 KW-81AATGCTCCGGTTTCATGTCATC  83 #26 KW-82 TAGTTCCTTCTCACCCGGAG  84 FR-117CACAAGGGCGCTTTAGTTTGTTTTCCG  85 #27 FR-118 ATCCCCTGAGAGTTTAATTTTCGTCAAG 86 KW-85 TAATTCGTCGTAATTCGTCCTCC  87 #28 KW-86 CTCTGCCTTCCTGTTTTTGTTGTG 88 FR-119 AAACGCATTTGCAACTGTCGGCGCTTTTCC  89 #29 FR-120CTTGTTACCTCAAAAAATCACAGTGCTCG  90 FR-121 GCAGTCGGTGATGCTGGATTTGCCCTG  91#30 FR-122 GTTTTTTTACGGGTAAGCCGCAACGACCATTG  92 FR-123TAGTAGATAAGTTTTAGATAAC  93 #31 FR-124 TAAAACTGAAGTTGCCCTGAAAATG  94FR-125 TGATGAGTGGTTCTGCAAGAGG  95 #32 FR-126 TAAAAGACAGATTACCTGGCCTG  96FR-127 CGGACTACCTCAAAATAAAGCTTTATATACG  97 #33 FR-128GTCATGATACCTTGATTAAAAAACAAACAGC  98 FR-129 GGCTATAATGCGCACATAACCTCTTG 99 #34 FR-130 AATCTTTTCTTATTTTTTGGCTAACGAATAGCC 100 FR-131GTCCAACTTTTTGGGGTCAGTACAAACTTTG 101 #35 FR-132TAATAACGCCGTTATTAAATAGCCTGCC 102 FR-133 TAAGCAACGTCTGCTTACTGCCCCTC 103#36 FR-134 GTGATGGCTTCTGATAAAGATAAATTTATAGCC 104 FR-135TAACAGGCTAAGAGGGGC 105 #37 FR-136 ATTGCCACTCTTCTTGATCAAATAACCG 106FR-137 AATGCGTCTGTTGATAATTCAAATTAGTC 107 #38 FR-138TAGCCGTTTTATTCAGTATAGATTTGCG 108 KW-89 GTTCGTCGGTAACCCGTTTCAGC 109 #39KW-90 ATGGCTTAAAGAGAGGTGCC 110 KW-91 CGTACTTTAAAGGGAGAATGAC 111 #40KW-92 GTGCTTCCTCATTATGGTGACG 112 KW-93 GAATGGAGGGAGATTACACG 113 #41KW-94 CCTTAGTGGGTAAACGCTTAC 114 KW-95 CTTTCAGGCAGCTAAGGAAAG 115 #42KW-96 CAATATGTATTATTGATTGAGTAAACGGG 116 KW-97 CCTCTTCCAGGAATAATCCC 117#43 KW-98 CGGAAAGCGGTTCACAGATC 118 KW-132 CTCGTAAGTTTCGCAGCTTATTA 119#43 (probe) KW-99 TGAAATTCCTGTCCGACAGG 120 #44 kW-100GCACTACCGCAATGTTATTGC 121 KW-101 GCTTACCCAATAAATAGTTACACG 122 #45 KW-102TAAAACCTGTCACAAATCACAAA 123 KW-103 GTGGCCTGCTTCAAACTTTCG 124 #46 KW-104GTAAAGTCTAGCCTGGCGGTTCG 125 FR-139 TAATTCTGGTACGCCTGGCAGATATTTTGCC 126#47 FR-140 ATCAACCTCAAAAGGGAAATCGGG 127 KW-105 TAACTTGTTGTAAGCCGGATCGG128 #48 KW-106 TGAAGCATCTATCGCCGGTTGCG 129 KW-107GATTAGAAATCCTTTTGAAAGCGCATTG 130 #49 KW-108 CTTATTGGGCACCGCAATGG 131KW-109 CGAACACAATAAAGATTTAATTCAGCC 132 #50 KW-110 CTGATGCTACTGTGTCAACG133 KW-111 AATAATCAGACATAGCTTAGGC 134 #51 KW-112 GCCGTGATGGTTTTCGCGTTC135 KW-113 TATTTTCCTCCCGCGCTAAAG 136 #52 KW-114 TTCAGCTGATGACCACCACGCTT137 KW-115 GAGTTGTCAGAGCAGGATGATTC 138 #53 KW-116 TATCTGCGCTTATCCTTTATGG139 KW-117 CCTTTACGGTGATAACCGTCGCG 140 #54 KW-118CTGACAAGCCTCTCATTCTCTTGTC 141 KW-119 GAGAATTATCGAGGTCCGGTATC 142 #55KW-120 CTACGCGTTAGCGATAGACTGC 143 FR-141 AGGCTTACTAAGAACACCAGGGGGAGGGGAA144 probe for 55-I FR-142 AGTCATAAGCTTCCCCGCTTACTAAGACTA 145probe for 55-II KW-121 CCTCAAATCGGCCATAATAACC 146 56 KW-122TAAACACCGTCGTCAGAAATGC 147 FR-143 TAGACTTTTATCCACTTTATTGCTG 148 #57FR-144 GTGTGCCTTTCGGCGATATGGCGTG 149 FR-145 CCTTTACGTGGGCGGTGATTTTGTC150 #58 FR-146 TAGCTTTGCTCCTGGATGTTTGCC 151 FR-147GCTGTAATTTATTCAGCGTTTGTACATACG 152 #59 (probe) FR-148TCAGTCAACTCGCTGCGGCGTGTTAC 153 #60 FR-149 CTTATTGTTGCTTAGTTAGGGTAGTCAC154 KW-131 CAGTCAGTCTCAGGGGAGGAGCAATC 155 #61 (probe) KW-59TGAATGCACAATAAAAAAATCCCGACCCTG 156 For DsrA KW-60AGTCGCGCAGTACTCCTCTTACCAG 157 Ig region KW-63 TAATTTCTCATCAGGCGGCTCTGC158 for RprA KW-64 TAACATTATCAGCCTGCTGACGGC 159 Ig region sp42-5′-1GGCCGAATTCGTAGGGTACAGAGGTAAG 160 for cloning sp42-3′-1GGCCGGATCCGTCATTACTGACTGGGGCGG 161 pBSspot42

RNA Analysis

RNA for Northern analysis was isolated directly from ˜3×10⁹ cells inexponential growth (OD₆₀₀=0.2-0.4) or stationary phase (overnightgrowth) as described previously (Wassarman, K. M. and Storz, G. 2000Cell 101:613-623). Five-μg RNA samples were fractionated on 10%polyacrylamide urea gels and transferred to Hybond N membrane asdescribed previously (Wassarman, K. M. and Storz, G. 2000 Cell101:613-623). For Northern analysis of candidate regions,double-stranded DNA probes were generated by PCR from a colony of MG1655cells or from the pCR-#N plasmids with oligonucleotides used for cloningthe pCR-#N plasmids. PCR amplification was done with 52° C. annealingfor 30 cycles in 1× PCR buffer (1 mM each dATP, dGTP, and dTTP; 2.5 μMdCTP; 100 μCi [α³²P] dCTP; 10 ng plasmid; 1 unit taq polymerase) (PerkinElmer). Probes were purified over G-50 microspin columns (AmershamPharmacia Biotech) prior to use. Northern membranes were prehybridizedin a 1:1 mixture of Hybrisol I and Hybrisol II (Intergen) at 40° C. DNAprobes with 500 μg sonicated salmon sperm DNA were heated for 5 min to95° C., added to prehybridization solution, and membranes werehybridized overnight at 40° C. Membranes were washed by rinsing twicewith 4×SSC/0.1% SDS at room temperature followed by three washes with2×SSC/0.1% SDS at 40° C. Northern blot analysis using RNA probes wasdone as described previously (Wassarman, K. M. and Steitz, J. A. 1992Mol Cell Biol 12:1276-1285). RNA probes were generated by in vitrotranscription according to manufacturer protocols (Roche MolecularBiochemicals) from pCR-#N plasmids linearized with EcoRV or HinDIIIusing SP6 RNA polymerase or T7 RNA polymerase, respectively; pBS-6S(pGS0112; Wassarman, K. M. and Storz, G. 2000 Cell 101:613-623) orpBS-spot42 were linearized with EcoRI using T3 RNA polymerase; pGEM-5S(pG5019; Altuvia, S. et al. 1997 Cell 90:43-53) or pGEM-10Sa (Altuvia,S. et al. 1997 Cell 90:43-53) were linearized with EcoRI using SP6 RNApolymerase. Oligonucleotide probes were labeled by polynucleotide kinaseaccording to manufacturer protocols (New England Biolabs) using[γ³²P]ATP (>5000 Ci/mmole; Amersham Pharmacia Biotech). Foroligonucleotide probes, Northern membranes were prehybridized inUltrahyb (Ambion) at 40° C. followed by addition of labeledoligonucleotide probe and hybridization overnight at 40° C. Membraneswere washed twice with 2×SSC/0.1% SDS at room temperature followed bytwo washes with 0.1×SSC/0.1% SDS at 40° C. for 15 minutes each.

Immunoprecipitation

Immunoprecipitations were carried out using extracts from cells inexponential growth (OD₆₀₀=0.2-0.4) or stationary phase (overnightgrowth) as described previously (Wassarman, K. M. and Storz, G. 2000Cell 101:613-623), using rabbit antisera against the Hfq protein orpreimmune serum. After immunoprecipitation, RNA was isolated fromProtein A Sepharose-antibody pellets by extraction withphenol:chloroform:isoamyl alcohol (50:50:1) followed by ethanolprecipitation. RNA was examined on gels directly after 3′ end labelingor analyzed by Northern hybridization after fractionation on 10%polyacrylamide urea gels as described previously (Wassarman, K. M. andStorz, G. 2000 Cell 101:613-623).

rpoS-lacZ Expression

Effects on rpoS-lacZ expression by multicopy plasmids containing thenovel sRNAs were determined from a single colony of SG30013 transformedwith pRS-#N, grown for 18 h in 5 ml of LB-ampicillin medium orM63-ampicillin medium supplemented with 0.2% glucose at 37° C.β-galactosidase activity in the culture was assayed as describedpreviously (Zhou, Y.-N. and Gottesman, S. 1998 J Bacteriol180:1154-1158). The numbers provided in Table 2 were calculated as theratio between pRS-#N and the pRS 1553 vector control.

Phenotype Testing

To test carbon source utilization or temperature sensitivity associatedwith the multicopy plasmids containing the novel sRNAs, a single colonyof MG1655 transformed with a given pRS-#N was grown for 6 hours in 5 mlLB-ampicillin medium at 37° C. Then 10 μl of serial dilutions (10⁻²,10⁻⁴, and 10⁻⁶) were spotted on M63-ampicillin plates containing 0.2% ofthe carbon source being tested (glucose, arabinose, lactose, glycerol,ribose, or succinate) and grown at 37° C.; or on LB plates incubated atroom temperature or 42° C. Plates were analyzed after both 1 and 2 days.Failure to grow in Table 2 indicates an efficiency of plating of <10⁻³.

Microarray Analysis

RNA for microarray analysis was isolated using the MasterPure RNApurification kit according to the manufacturer protocols (Epicentre)from MG1655 cells grown to OD₆₀₀=0.8 in LB medium at 37° C. DNA wasremoved from RNA samples by digestion with DNase I for 30 min at 37° C.Probes for microarray analysis were generated by one of two methods:direct labeling of enriched mRNA or generation of labeled cDNA.

To generate direct labeled RNA probes, mRNA enrichment and labeling wasdone as described in the Affymetrix expression handbook (Affymetrix).Oligonucleotide primers complementary to 16S and 23S rRNA were annealedto total RNA followed by reverse transcription to synthesize cDNAstrands complementary to 16S and 23S rRNA species. 16S and 23S weredegraded with RNase H followed by DNase I treatment to remove cDNA andoligonucleotides. Enriched RNA was fragmented for 30 min at 95° C. in 1×T4 polynucleotide kinase buffer (New England Biolabs), followed bylabeling with γ-S-ATP and T4 polynucleotide kinase and ethanolprecipitation. The biotin label was introduced by resuspending RNA in 96μl of 30 mM MOPS (pH 7.5), 4 μl of a 50 mM Iodoacetylbiotin solution,and incubating at 37° C. for 1 hr. RNA was purified using the RNA/DNAMini Kit according to manufacturer protocols (QIAGEN).

To generate cDNA probes, 5 μg of total RNA was reverse transcribed usingthe Superscript II system for first strand cDNA synthesis (LifeTechnologies) and 500-ng random hexamers. RNA and primers were heated to70° C. and cooled to 25° C.; reaction buffer was then added, followed byaddition of Superscript II and incubation at 42° C. RNA was removed byRNase H and RNase A. The cDNA was purified using the Qiaquick cDNApurification kit (QIAGEN) and fragmented by incubation of up to 5 μgcDNA and 0.2 U DNase I for 10 min at 37° C. in 1× one-phor-all buffer(Amersham-Pharmacia Biotech). The reaction was stopped by incubation for10 min at 99° C., and fragmentation was confirmed on a 0.7% agarose gelto verify that average length fragments were 50-100 nt. Fragmented cDNAwas 3′-end-labeled with terminal transferase (Roche MolecularBiochemicals) and biotin-N6-ddATP (DuPont/NEN) in 1× TdT buffer (RocheMolecular Biochemicals) containing 2.5 mM cobalt chloride for 2 hours at37° C.

Hybridization to microarrays and staining procedures were done accordingto the Affymetrix expression manual (Affymetrix). The arrays were readat 570 nm with a resolution of 3 μm using a laser scanner.

The expression of genes was analyzed using the Affymetrix MicroarraySuite 4.01 software program. Detection of transcripts in intergenicregions was done using the intensities of each probe designed to be aperfect match and the corresponding probe designed to be the mismatch.If the perfect match probe showed an intensity that was 200 units higherthan the mismatch probe, the probe pair was called positive. Twoneighboring positive probe pairs were considered evidence of atranscript. The location and length of the transcripts were estimatedbased on the first and last identified positive probe pair within an Igregion.

While the present invention has been described in some detail forpurposes of clarity and understanding, one skilled in the art willappreciate that various changes in form and detail can be made withoutdeparting from the true scope of the invention. All patents, patentapplications and publications referred to above are hereby incorporatedby reference.

1. An isolated polynucleotide comprising Candidate #14, or itscomplement, or its homolog having at least about 95% identity thereto.2. An isolated polynucleotide comprising Candidate #24, or itscomplement, or its homolog having at least about 95% identity thereto.3. An isolated polynucleotide comprising Candidate #26, or itscomplement, or its homolog having at least about 95% identity thereto.4. An isolated polynucleotide comprising Candidate #31, or itscomplement, or its homolog having at least about 95% identity thereto.5. An isolated polynucleotide comprising Candidate #50, or itscomplement, or its homolog having at least about 95% identity thereto.